Over 370000 used cars scraped with Scrapy from Ebay-Kleinanzeigen. The content of the data is in german, so one has to translate it first if one can not speak german. Those fields are included:
dateCrawled : when this ad was first crawled, all field-values are taken from this date
name : “name” of the car
seller : private or dealer
offerType
price : the price on the ad to sell the car (Currency is EURO aloms 1Euro = 1.06Dollar)
abtest
vehicleType
yearOfRegistration : at which year the car was first registered
gearbox
powerPS : power of the car in PS (mesurement is horsepower (4500 kilometer per munite))
model
kilometer : how many kilometers the car has driven (1 kilometer = 0.621371 Mile)
monthOfRegistration : at which month the car was first registered
fuelType
brand
notRepairedDamage : if the car has a damage which is not repaired yet
dateCreated : the date for which the ad at ebay was created
nrOfPictures : number of pictures in the ad
postalCode
lastSeenOnline : when the crawler saw this ad last online
## 'data.frame': 189349 obs. of 20 variables:
## $ dateCrawled : Factor w/ 164590 levels "2016-03-05 14:06:22",..: 96234 96080 44613 62122 135465 158918 142369 82743 161561 60299 ...
## $ name : Factor w/ 128113 levels "<U+0096>_ein_Kombi_+_4WD__perfekt_fuer_alle_Lebenslagen__",..: 43604 2400 50197 44742 94997 15678 80970 118441 34980 119726 ...
## $ seller : Factor w/ 2 levels "gewerblich","privat": 2 2 2 2 2 2 2 2 2 2 ...
## $ offerType : Factor w/ 2 levels "Angebot","Gesuch": 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 480 18300 9800 1500 3600 650 2200 0 14500 999 ...
## $ abtest : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
## $ vehicleType : Factor w/ 9 levels "","andere","bus",..: 1 5 9 6 6 8 4 8 3 6 ...
## $ yearOfRegistration : int 1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
## $ gearbox : Factor w/ 3 levels "","automatik",..: 3 3 2 3 3 3 3 3 3 3 ...
## $ powerPS : int 0 190 163 75 69 102 109 50 125 101 ...
## $ model : Factor w/ 251 levels "","1_reihe","100",..: 119 1 120 119 104 13 9 42 58 119 ...
## $ kilometer : int 150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
## $ monthOfRegistration: int 0 5 8 6 7 10 8 7 8 0 ...
## $ fuelType : Factor w/ 8 levels "","andere","benzin",..: 3 5 5 3 5 3 3 3 3 1 ...
## $ brand : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
## $ notRepairedDamage : Factor w/ 3 levels "","ja","nein": 1 2 1 3 3 2 3 3 1 1 ...
## $ dateCreated : Factor w/ 97 levels "2014-03-10 00:00:00",..: 83 83 73 76 90 94 91 80 94 76 ...
## $ nrOfPictures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ postalCode : int 70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
## $ lastSeen : Factor w/ 111190 levels "2016-03-05 14:15:08",..: 107390 106918 92424 25353 100813 104198 94797 50176 88971 71117 ...
## price vehicleType yearOfRegistration
## Min. : 0 limousine :48664 Min. :1000
## 1st Qu.: 1150 kleinwagen:40740 1st Qu.:1999
## Median : 2950 kombi :34464 Median :2003
## Mean : 10895 :19421 Mean :2005
## 3rd Qu.: 7200 bus :15520 3rd Qu.:2008
## Max. :99999999 cabrio :11662 Max. :9999
## (Other) :18738
## gearbox powerPS model kilometer
## : 10384 Min. : 0.0 golf : 15276 Min. : 5000
## automatik: 39191 1st Qu.: 70.0 andere : 13447 1st Qu.:125000
## manuell :139634 Median :105.0 3er : 10518 Median :150000
## Mean :112.1 : 10382 Mean :125639
## 3rd Qu.:150.0 polo : 6709 3rd Qu.:150000
## Max. :999.0 corsa : 6410 Max. :150000
## (Other):126467
## fuelType brand
## benzin :114031 volkswagen :40650
## diesel : 54936 bmw :20525
## : 16907 opel :20409
## lpg : 2731 mercedes_benz:17918
## cng : 306 audi :16668
## hybrid : 142 ford :13085
## (Other): 156 (Other) :59954
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 1.15e+03 2.95e+03 1.09e+04 7.20e+03 1.00e+08
as we can see there are a lot of outliers for example, there should not be any car for free and also the upper band is very high. So we need to clean some data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 209 1299 2990 4611 6559 19790
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 1999 2003 2004 2008 9999
also for this column max date can not be after 2016. this is which year data from. also first car ivented at 1885 (whick is still to less but), so there cannot be any thing before that year.
after cleaning process:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1951 1999 2003 2002 2007 2015
lets create the histograms:
## andere bus cabrio coupe kleinwagen
## 4623 1502 14394 9945 7707 38039
## kombi limousine suv
## 32015 44531 5804
I think it is stange that there are more limusine than the other kind of cars.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 75.0 105.0 110.8 143.0 953.0
Cleaning
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 75.0 105.0 110.6 143.0 490.0
## alfa_romeo audi bmw chevrolet chrysler
## 1041 13535 17128 811 649
## citroen dacia daewoo daihatsu fiat
## 2294 431 233 327 4202
## ford honda hyundai jaguar jeep
## 11002 1232 1637 222 305
## kia lada lancia land_rover mazda
## 1124 89 198 265 2550
## mercedes_benz mini mitsubishi nissan opel
## 14805 1493 1295 2197 17150
## peugeot porsche renault rover saab
## 5112 348 7693 208 240
## seat skoda smart sonstige_autos subaru
## 3007 2658 2540 1175 333
## suzuki toyota trabant volkswagen volvo
## 1059 2172 211 34001 1519
volkswagen looks like the most popular car in the market. lets check which kind of volkswagens are in market.
## automatik manuell
## 6434 30888 121169
The number of cars with manual gearbox is higher than the automatic ones. This is not surprising considering the ages and kilometers of the second-hand cars.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5000 125000 150000 128300 150000 150000
The concentration on the 100.000+ km, particularly 150.000 km, is interesting, even that scale_y_log10() becomes necessary to observe the distribution.
## andere benzin cng diesel elektro hybrid lpg
## 9240 75 99033 277 47232 32 110 2492
there are a lot of missing data in this column so we need to use subset
## $x
## [1] "Kilometer"
##
## $y
## [1] "number of car"
##
## $title
## [1] "Kilometer Diagram"
##
## attr(,"class")
## [1] "labels"
Majority of the second hand cars’ fuel type are benzin or diesel (as expected) but hybrid, elektro, cng stands for emergence of new trends in the second-hand car market.
as we can see form the table, the corrolation between powerPS and Price is stronger than others. and also there is prity good corrolation number between price and kilometer
I was expecting that new cars would have the high price but some of the cars even they are quite antique still have high price. those might be special cars such as classics.
the relationship between this are very clear as expected. cars with high horse power are most of the time expensive cars.
I was expecting that when the kilometer is low, pice would be high. but again some cars behave differently. and again those car might be classics. to understand this situation we need to add another veriable.
as we can see horse power biger cars more likely to go more kilometer or they try to sell it on more kilometer.
We observe that SUV is the most expensive vehicle type while kleinwagen is the cheapest. However, kleinwagen have many outliers which may signify either user error or specific higher end brand and model combination.
There is linear correlation between engine power (PowerPS) and price in each vehicle type but after 150 powerPS it is possible to observe non-linarites.
As expected, the second-hand cars with automatic gearbox are more expensive than manual ones.
as we can see in lower costs purple is more intense. So manuel cars are more likely to be cheaper. but for range (2500 to 5000) outomatic cars has the lead. but after arout 50000 the both gearbox type has same behavior.
kleinwagen model cars are small and has small engine. so that It is not a suprise that they are cheap. also the andre is german word for other. So there is a lot of car models can be in that category. it is hard to trust that line since we dont know what kind a car it is.
As expected, the second-hand cars with automatic gearbox are more expensive than manual ones.
There is linear correlation between engine power (PowerPS) and price in each vehicle type but after 150 powerPS it is possible to observe non-linarites.
We observe that SUV is the most expensive vehicle type while kleinwagen is the cheapest. However, kleinwagen have many outliers which may signify either user error or specific higher end brand and model combination.
Limousine, kombi and kleinwagen are the most popular vehicle types in the second-hand market. Most expensive cars are SUV’s while the cheapest ones are kleinwagens. On average Kleinwagen vehicle type is the cheapest and has the lowest engine power. But it also shows the most outliers - might be as a result of brand-model diversity. The most popular brands are Volkswagen, BMW, Opel, Mercedes, Audi, Ford, Renault, Peugeot, Fiat and Seat. These 10 brand correspond to almost 80% of the cars. (Originally our dataset contains around 40 brands) According to our regression analysis, age (39%), kilometer(%23) and engine power(%19) are the most important factors explaining second hand price. Most of the cars in the second-hand market are above 100.000 km, even 150.000 km. People does not frequently change cars according to our data set. Majority of the second-hand cars are sold only within 35 days. The ratio of the first 10 days (day 0 stands for same day sale) is quite high. This shows us that either Ebay-Kleinanzeigen is very successful at targeting customers or the second-hand market is more fluid that we actually thought. To our surprise, there is no strong/significant correlation between selling time and vehicle type, kilometer and price. We saw that whenever price goes up the change to be sold in 10-20 days increases especially in SUV vehicles (rather than 0-10 days) but this is not a general trend. Hybrid (electro engine, CNG) second-hand car market is emerging but shows longer selling time trend.
Personaly I have hard time because the data was about Germany’s used car and I am from Turkey and living in USA. So I didn’t have any knowledge in German market. but it was a very awesome experience forlearing to know very little and figuring out from the raw data.